Exposing Instruction Level Parallelism in the Presence of Loops

نویسنده

  • Marcos de Alba
چکیده

In this thesis we explore how to utilize a loop cache to relieve the unnecessary pressure placed on the trace cache by loops. Due to the high temporal locality of loops, loops should be cached. We have observed that when loops contain control flow instructions in their bodies it is better to collect traces on a dedicated loop cache instead of using trace cache space. The traces of instructions within loops tend to exhibit predictable patterns that can be detected and exploited at run-time. We propose to capture dynamic traces of loop bodies in a loop cache. The novelty of this loop cache consists of dynamically capturing loop iterations with conditional branches and correlating them to unique loops. Once loop iterations are cached in the loop cache, their bodies can be provided by the loop cache without polluting the trace cache and without any instruction cache accesses. The proposed loop cache includes hardware capable of dynamically unfolding loops such that large traces of instructions are accessed in a single loop cache interrogation. We evaluate our loop cache and compare it against a baseline machine with a larger first-level instruction cache. We also consider how the loop cache can compliment the introduction of a trace cache by filtering out loop traces that needlessly dominate the trace cache space. We quantify the benefits provided by a fetch engine equipped with the proposed loop cache and unrolling hardware. In our experiments we explore the design space of a loop cache and associated unfolding hardware and evaluate its efficiency to detect independent iterations in loops in SPECint2000, Media-Bench and MiBench applications. We show that trace cache efficiency and ILP can be significantly improved using our loop caching scheme. This improvement translates into up to 38% performance speedup when compared to a baseline machine with a loop cache and no trace cache to a baseline machine with no loop cache. Further experiments show up to a 16% speedup on a hybrid machine with loop and trace cache compared to a machine with a larger 1 cache and a trace cache.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallélisme des nids de boucles pour l'optimisation du temps d'exécution et de la taille du code. (Nested loop parallelism to optimize execution time and code size)

The real time implementation algorithms always include nested loops which require important execution times. Thus, several nested loop parallelism techniques have been proposed with the aim of decreasing their execution times. These techniques can be classified in terms of granularity, which are the iteration level parallelism and the instruction level parallelism. In the case of the instructio...

متن کامل

Title Predicting Unroll Factors Using Supervised Classification Authors

Compilers base many critical decisions on abstracted architectural models. While recent research has shown that modeling is effective for some compiler problems, building accurate models requires a great deal of human time and effort. This paper describes how machine learning techniques can be leveraged to help compiler writers model complex systems. Because learning techniques can effectively ...

متن کامل

Balancing Fine- and Medium-Grained Parallelism in Scheduling Loops for the XIMD Architecture

This paper presents an approach to scheduling loops that leverages the distinctive architectural features of the XIMD, particularly the variable number of instruction streams and low synchronization cost. The classical VLIW and MIMD architectures have a fixed number of instruction streams, each with a fixed width. A compiler for the XIMD architecture can exploit fine-grained parallelism within ...

متن کامل

Swing Modulo Scheduling

[19] B.R. Rau and C.D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. [22] J. Wang and C. Eisenbeis. Decomposed software pipelining: A new approach to exploit instruction level parallelism for loops programs. In IFIP, January 1993.

متن کامل

Branch Prediction of Conditional Nested Loops through an Address Queue

Multi-dimensional applications, such as image processing and seismic analysis, usually require the optimized performance obtained from instruction-level parallelism. The critical sections of such applications consist of nested loops with the possibility of embedded conditional branch instructions. Branch prediction techniques usually require extra hardware, redundancy or do not guarantee the pr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computación y Sistemas

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2004